How to read many csv files and combine them into an numpy array using multiprocesing

#python

To read many csv files in python, usually pandas is utilized to combine all the files into one data frame. However, creating a pandas data frame and to assign each row/column in a loop is very slow. So, instead of creating a pandas data frame, using a numpy array is preferred, which is much faster.

To use multicore of the CPU, use map_async or map.

Below is an example of how to read all csv files in a folder, and then sort the files by name, and read them using multiprocessing.

If using pandas dataframe to combine the files without multiprocessing, the run time is about 10 minutes; if using numpy array with multiprocessing, the run time is only 23 seconds.

Additionally, using polars instead of pandas will also speed the file reading process.

import numpy as np
import polars as pl


def read_data(file_path):
    _data = pl.read_csv(file_path, has_header=False).to_numpy()
    _data[0, 1:] = _data[1, 1:]
    _time = _data[0, 0]
    return _time, _data[0, 1:]


if __name__ == "__main__":
    files = [file for file in pathlib.Path("./data_original/1").glob("*.csv")]
    sorted_path = sorted(files, key=lambda name: int(str(name)[16:-4]))
    spectra_all = np.zeros((len(files), 251))
    time_start = time.time()
    pool = mp.Pool(8)
    df_list = [pool.map_async(read_data, sorted_path)]
    pool.close()
    print(time.time() - time_start)
    time_list = ["1"] * len(files)
    for i in range(len(files)):
        if i % 100 == 0:
            print(i)
        spectra_all[i, :] = df_list[0].get()[i][1]
        time_list[i] = df_list[0].get()[i][0]
    print(time.time() - time_start)
    # combined_df = pd.concat(df_list, ignore_index=True)
    # print(df_list[0].head())